Code
# Package vector names
packages <- c("tidyverse", "knitr", "ape", "geiger", "caper", "phytools") In this hands-one, you will learn basic tools in R for visualizing phylogenies, testing models of character evolution, performing phylogenetic correction of a regression model, and test for the phylogenetic signal of continuous characters. This lab is based in part on one designed by LukeHarmon for a workshop that he and others ran at Ilha Bela, Brazil; the original can be seen here There are many other useful labs in comparative analysis from that workshop that you can peruse at your leisure.
You will need two datasets, that will be provided for you:
A data.frame with species traits – CDR_traits_DIAZ_imputed.csv
A phylogenetic tree – CDR_timeTree.tacted.nexus.tre
For this hands-on, you will need to have a set of R packages to do this lab. Install the following packages:
# Package vector names
packages <- c("tidyverse", "knitr", "ape", "geiger", "caper", "phytools") You can use the function install.packages() to install the packages.
If you don’t want to install the packages one by one, you can use the next command.
# Install packages not yet installed
# get packages already installed
installed_packages <- packages %in% rownames(installed.packages())
# If the packages are installed skip if not install them
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages], dependencies = TRUE)
}This command, will, first, check if you already the packages installed, then if a package is not installed in your computer, will install it.
Load installed packages:
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ape)
Attaching package: 'ape'
The following object is masked from 'package:dplyr':
where
library(geiger)Loading required package: phytools
Loading required package: maps
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
library(phytools)
library(caper)Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: mvtnorm
Set up a working directory. Tell R that this is the directory you will be using, and read in your data:
You can use the function getwd() to get the current working directory.
setwd("path/for/your/directory")You can use the function dir.create() to get create a series of folders within your working directory. For example, if you run dir.create(“Output”) it will create an empty folder–named Output–within your working directory. This folder then can be used to store the results from the lab.
Load the data. Instead of reading files from the computer we will pull the required data directly from the internet.
## Trait data
traitData <- read_csv("https://raw.githubusercontent.com/jesusNPL/Workshop_PCM_HandsOn/refs/heads/main/Data/CDR_traits_DIAZ_imputed.csv") %>%
mutate(species = tipLabel) %>%
column_to_rownames("tipLabel") # we are using the column "tipLabel" as rownamesRows: 582 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): kingdom, phylum, majorGroup, class, order, family_PHY, family_WFO,...
dbl (10): IN_Phylogeny, leaf_area, Nmass, LMA, plant_height, diaspore_mass, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Phylogenetic data
phyData <- read.nexus("https://raw.githubusercontent.com/jesusNPL/Workshop_PCM_HandsOn/refs/heads/main/Data/CDR_timeTree_tacted.nex")
phyData <- force.ultrametric(phyData)***************************************************************
* Note: *
* force.ultrametric does not include a formal method to *
* ultrametricize a tree & should only be used to coerce *
* a phylogeny that fails is.ultrametric due to rounding -- *
* not as a substitute for formal rate-smoothing methods. *
***************************************************************
This operator is, maybe, the most used operator from the {dplyr} package and is used to perform a sequence of operations on a data frame. In other words, the pipe operator simply feeds the results of one operation into the next operation below it.
Previous lines of code will only work if you have set your working directory (WD) and only if you have the folder Data within the WD. You can check the Intro-R lab for more details.
OK. You should be ready to go.
Let’s inspect the data first, to do that we will use the function “glimpse()” of the R package {dplyr}
glimpse(traitData)Rows: 582
Columns: 29
$ kingdom <chr> "Plantae", "Plantae", "Plantae", "Plantae", "Plan…
$ phylum <chr> "Tracheophyta", "Tracheophyta", "Tracheophyta", "…
$ majorGroup <chr> "G", "A", "A", "A", "A", "A", "A", "A", "A", "A",…
$ class <chr> "Ginkgoopsida", "Magnoliopsida", "Magnoliopsida",…
$ order <chr> "Ginkgoales", "Asterales", "Fabales", "Gentianale…
$ family_PHY <chr> "Ginkgoaceae", "Asteraceae", "Fabaceae", "Apocyna…
$ family_WFO <chr> "Ginkgoaceae", "Asteraceae", "Fabaceae", "Apocyna…
$ genus_PHY <chr> "Ginkgo", "Achillea", "Amphicarpaea", "Apocynum",…
$ genus_WFO <chr> "Ginkgo", "Achillea", "Amphicarpaea", "Apocynum",…
$ specificEpithet <chr> "biloba", "millefolium", "bracteata", "androsaemi…
$ sciname_IN_WFO <chr> "Ginkgo biloba", "Achillea millefolium", "Amphica…
$ sciname_IN_Phylogeny <chr> "Ginkgo biloba", "Achillea millefolium", "Amphica…
$ IN_Phylogeny <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ Imputed <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
$ sciname_IN_DIAZ <chr> "Ginkgo biloba", "Achillea aspleniifolia", "Amphi…
$ sciname_IN_BIEN <chr> "Ginkgo biloba", "Achillea millefolium", NA, "Apo…
$ Woodiness <chr> "woody", "non-woody", "non-woody", "non-woody", "…
$ growth_form <chr> "tree", "herbaceous non-graminoid", "climber", "h…
$ leaf_type <chr> "broadleaved", "broadleaved", "broadleaved", "bro…
$ leaf_area <dbl> 1336.0000, 2701.6550, 1459.3500, 7156.8200, 2598.…
$ Nmass <dbl> 20.30000, 16.97002, 30.41157, 17.25568, 17.24000,…
$ LMA <dbl> 92.28989, 71.61548, 26.57855, 51.59986, 97.94319,…
$ plant_height <dbl> 32.8258460, 1.1378405, 0.8581601, 0.7358208, 0.95…
$ diaspore_mass <dbl> 1083.930700, 0.100000, 30.792000, 0.228900, 0.951…
$ SSD_observed <dbl> 0.4500000, 0.2966251, 0.2515797, 0.3535481, 0.285…
$ LDMC <dbl> 0.3493534, 0.2923324, 0.2616435, 0.3345541, 0.281…
$ SSD_imputed <dbl> 0.3121684, 0.2493996, 0.2348250, 0.2711764, 0.245…
$ SSD_combined <dbl> 0.4500000, 0.2530428, 0.2436388, 0.2711764, 0.249…
$ species <chr> "Ginkgo_biloba", "Achillea_millefolium", "Amphica…
Let’s check if our trait data contain the same species as the phylogeny
tmp <- name.check(phy = phyData, data = traitData)
# print the results
tmp$tree_not_data
[1] "Andropogon_gerardii" "Anemone_patens"
[3] "Arabis_divaricarpa" "Capsellabursa-pastoris"
[5] "Colladonia_sp." "Demia_testudo"
[7] "Desmodium_glutinosum" "Echinacea_serotina"
[9] "Eupatorium_maculatum" "Festuca_guestphalica"
[11] "Galium_asperellum" "Galium_trifolium"
[13] "Gnaphalium_purpureum" "Helianthemum_bicknellii"
[15] "Lithospermum_caroliniense" "Miscanthus_giganteus"
[17] "Panicum_perlongum" "Parthenocissus_quinquefolia"
[19] "Petalostemon_candidus" "Petalostemon_purpureus"
[21] "Pilosella_longipila" "Saxifraga_pensylvanica"
[23] "Symphyotrichumnovae-angliae" "Trientalis_borealis"
$data_not_tree
[1] "Capsella_bursa-pastoris" "Deamia_testudo"
[3] "Galium_asprellum" "Galium_bifolium"
[5] "Ginkgo_biloba" "Heptaptera_sp."
[7] "Miscanthus_longiberbis" "Symphyotrichum_novae-angliae"
It indicates that several species are not present either in the trait data or phylogeny, so let’s drop this species from the phylogeny and remove from the trait data. To do that we will use the function drop.tip() of the package {ape}
phyData <- drop.tip(phy = phyData, tip = tmp$tree_not_data)
traitData <- traitData %>%
filter(species %in% phyData$tip.label)We can double check if our data match after dropping the missing species
name.check(phy = phyData, data = traitData)[1] "OK"
Another option is to use the function “comparative.data” of the R package {caper}
## Create comparative.data object for further analyses
compCDR <- comparative.data(
phy = phyData,
data = traitData,
names.col = "species",
vcv.dim = TRUE,
warn.dropped = FALSE
)Anyway, now it seems that we are ready to go!
Let’s start by looking at the phylogeny of these birds and learning a bit about how to work with trees in R.
What does your tree look like?
plot(phyData)Answer: Whoa. That’s ugly. Let’s clean it up.
plot.phylo(phyData,
no.margin = TRUE,
cex = 0.5)Better. You can mess around with tree plotting functions in plot.phylo() as much as you’d like. Try this for example:
plot.phylo(phyData,
type = "fan",
no.margin = TRUE,
cex = 0.3)Much much better.
It may be useful to understand how trees are encoded in R. Typing in just the name of the tree file like this:
phyData
Phylogenetic tree with 574 tips and 573 internal nodes.
Tip labels:
Juniperus_communis, Juniperus_sp., Juniperus_virginiana, Picea_sp., Larix_laricina, Pinus_strobus, ...
Rooted; includes branch lengths.
will give you basic information about the phylogeny: the number of tips and nodes; what the tips are called; whether the tree is rooted; and if it has branch lengths.
str(phyData)List of 4
$ edge : int [1:1146, 1:2] 575 576 577 577 578 578 576 579 580 580 ...
$ edge.length: num [1:1146] 1.46e+02 1.22e+02 5.04 6.54e-03 5.03 ...
$ Nnode : int 573
$ tip.label : chr [1:574] "Juniperus_communis" "Juniperus_sp." "Juniperus_virginiana" "Picea_sp." ...
- attr(*, "class")= chr "phylo"
- attr(*, "order")= chr "cladewise"
- attr(*, "RSS")= num 2.77e-07
will tell you more about tree structure. Trees consist of tips connected by edges (AKA branches)
phyData$tip.labelgives you a list of all your terminal taxa, which are by default numbered 1-n, where n is the number of taxa.
phyData$Nnodegives you the number of nodes. This is a fully bifurcating rooted tree, so it has 1 fewer node than the number of taxa.
phyData$edgeThis tells you the beginning and ending node for all edges.
Put that all together with the following lines
plot.phylo(phyData,
type = "fan",
no.margin = TRUE,
cex = 0.7,
label.offset = 0.1,
show.tip.label = FALSE)
nodelabels(cex = 0.5)
tiplabels(cex = 0.5)There are many ways to manipulate trees in R using {ape}, {Phytools}, and other packages. This just gives you a bare-bones introduction.
Let’s ask some questions using the trait data that were measured for these birds. First, explore the data in the “traitData” matrix. Here are some options for visualizing data matrices:
traitData %>%
head() # this will show you the first few rows of your data matrix and its header
traitData %>%
dimnames() # this will show you the row and column headers for your matrix
traitData %>%
View() # this will let you visualize the entire matrixleaf area is one of your variables found in the trait dataset. Let’s isolate it so we can work with it easily:
leaf_area <- traitData[, "leaf_area"]
names(leaf_area) <- rownames(traitData)
# data vectors have to be labelled with tip names for the associated tree.
# This is how to do that. It is good practice to check the distribution of your data before doing downstream analysis.
hist(leaf_area)What about if we log scale the HWI?
hist(log(leaf_area))Does it look different/similar?
In the lecture, we talked about one model of character evolution, called a Brownian Motion model. This model assumes that a trait evolves from a starting state (z0) according to a random walk with the variance specified by the rate parameter \(\sigma^{2}\) (sigma-squared). In short, Brownian motion describes a process in which tip states are modeled under the assumption of a multivariate normal distribution. On a phylogeny, the multivariate mean of tip states is equal to the root state estimate, and variance accumulates linearly through time.
What does Brownian Motion evolution of leaf area look like?
brownianModel <- fitContinuous(phy = phyData,
dat = log(leaf_area))
brownianModel # this will show you the fit statistics and parameter valuesGEIGER-fitted comparative model of continuous data
fitted 'BM' model parameters:
sigsq = 35.449783
z0 = 6.127122
model summary:
log-likelihood = -2534.051403
AIC = 5072.102805
AICc = 5072.123821
free parameters = 2
Convergence diagnostics:
optimization iterations = 100
failed iterations = 0
number of iterations with same best fit = 100
frequency of best fit = 1.000
object summary:
'lik' -- likelihood function
'bnd' -- bounds for likelihood search
'res' -- optimization iteration summary
'opt' -- maximum likelihood parameter estimates
Answer: The estimated ancestral state (z0) under Brownian Motion evolutionary model suggests that leaf area was ~6.127122. This value is similar to the current mean trait log-value (\(\mu\) = 6.87). This model also suggests that leaf area is evolving at a rate \(\sigma^{2}\) = 35.45.
mean(log(leaf_area))[1] 6.868803
Here, you can see the estimates for ancestral state (z0), and the rate parameter (\(\sigma^{2}\)), as well as some measures of model fit. The fit of the model is determined using maximum likelihood, and expressed as a log likelihood (lnL). The higher the lnL, the more probable the data given the model. However, when comparing different models, we can’t use the lnL, because it does not account for the difference in the number of parameters among models. Models with more parameters will always fit better, but do they fit significantly better? For example an OU model has 4 parameters (alpha [\(\alpha\)], theta [\(\theta\)], z0, and sigma-squared [\(\sigma^{2}\)]), so it should fit better than a BM model, which includes only z0 and \(\sigma^{2}\). To account for this, statisticians have developed another measure of fit called the AIC (Akaike Information Criterion): AIC = (2xN)-2xlnL, where N is the number of parameters. This penalizes the likelihood score for adding parameters. When selecting among a set of models, the one with the lowest AIC is preferred. We will use this information later on in this lab.
In addition to assessing model fit, we can use the Brownian Motion model to reconstruct ancestral states of a character on a tree. To visualize what BM evolution of this trait looks like on a tree. The contMap() command in {phytools} estimates the ancestral states and plots them on a tree.
## Calculate number of trait shifts
obj <- contMap(phyData,
log(leaf_area),
fsize = 0.1,
lwd = 2,
type = "fan",
plot = FALSE)
# change colors
obj <- setMap(obj,
c("white", "#FFFFB2", "#FECC5C", "#FD8D3C", "#E31A1C"))
# Plot the results
plot(obj,
fsize = c(0.2, 0.8),
leg.txt = "Leaf area")Describe the evolution of Hand-wing index on this tree. How many times have extremely high and extremely low Hand-wing index evolved on this tree?
Answer: This phylogenetic tree contains 249 species, and the mapped values range between 1.96 and 3.57. By visual inspection, overall, the HWI tended to evolve more low values than high values; indeed, HWI seems to evolve high values in about 5-7 branches. The species with higher HWI index are Berlepschia rikeri, Geositta antarctica, Geositta isabellina, Geositta saxicolina, and Geositta maritima.
plot(obj,
fsize = c(0.6, 0.8),
type = "fan",
leg.txt = "Hand-Wing Index")What does this say about our ability to test hypotheses about the evolution of leaf area?
Answer: Although the Brownian Motion model is simple, it helps us understand how fast or slow a specific attribute is evolving and to identify branches with different evolutionary regimes.
Let’s go ahead and test some hypotheses. Plant height is another trait in your data matrix. Let’s assess whether there is a correlation between leaf area and plant size?
plant_height <- traitData[, "plant_height"]
names(plant_height) <- rownames(traitData)Let’s see if range size follow a normal distribution
hist(log(plant_height))Let’s look at a plot of range size as a function of Hand-wing index.
traitData %>%
ggplot(aes(x = log(plant_height), y = log(leaf_area))) +
geom_point(alpha = 0.5, color = "darkgray", size = 3) +
labs(x = "log(Plant height)", y = "log(Leaf area)")Hm. looks promising.
How would you describe the relationship between these two variables?
Answer: At first glance, it seems that there is a positive association between leaf area and plant height.
Let’s be more quantitative in describing that relationship with a linear model.
lmModel <- lm(log(leaf_area) ~ log(plant_height))
summary(lmModel)
Call:
lm(formula = log(leaf_area) ~ log(plant_height))
Residuals:
Min 1Q Median 3Q Max
-4.7333 -0.3098 0.1543 0.5337 3.0931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.89287 0.04276 161.19 <2e-16 ***
log(plant_height) 0.38352 0.03657 10.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.023 on 572 degrees of freedom
Multiple R-squared: 0.1613, Adjusted R-squared: 0.1598
F-statistic: 110 on 1 and 572 DF, p-value: < 2.2e-16
traitData %>%
ggplot(aes(x = log(plant_height), y = log(leaf_area))) +
geom_point(alpha = 0.5, color = "darkgray", size = 3) +
labs(x = "log(Plant height)", y = "log(Leaf area)") +
geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'
The coefficients table from the summary() command shows the slope and intercept for the linear model describing leaf area as a function of plant height. Each line shows the estimated coefficient (Estimate), the standard error (Std. Error) of that estimate, as well as a t-statistic and associated p-value, testing whether those parameters are equal to 0. The Multiple R-squared is an estimate of how much variance in the response variable can be explained by the predictor variable.
Write the linear model for this relationship. Are the parameters significantly different from 0?
Answer: LA(y) ~ \(\alpha\) = 6.89287 + \(\beta\) = 0.38352(log(plant height)) + \(\epsilon\)
The coefficients (\(\alpha\) = Intercept and \(\beta\) = Slope) of this model are different from zero, that is, the model suggest that there is a positive association between the two variables (Leaf area ~ plant height).
What is the R^2 value for this data?
Answer: Despite the coefficients of the model being different from zero, the adjusted R^2 = 0.16 meaning that plant height explains 16% of the variation of leaf area.
How do you feel about that?
Nice. But, we have not considered the fact that these plants in Cedar Creek are related to each other. As such, they may share similar trait values simply due to the fact that their ancestors had large leaf area and heigth or the reverse. In other words, we need to account for non-independence of residuals due to phylogeny. One way to do that is to use phylogenetic-generalized-least-squares regression (PGLS).
pglsModel <- pgls(log(leaf_area) ~ log(plant_height), lambda = "ML", data = compCDR)
summary(pglsModel)
Call:
pgls(formula = log(leaf_area) ~ log(plant_height), data = compCDR,
lambda = "ML")
Residuals:
Min 1Q Median 3Q Max
-0.33751 -0.03045 0.01565 0.05670 0.26769
Branch length transformations:
kappa [Fix] : 1.000
lambda [ ML] : 0.575
lower bound : 0.000, p = 4.4409e-16
upper bound : 1.000, p = < 2.22e-16
95.0% CI : (0.398, 0.721)
delta [Fix] : 1.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.332669 0.616152 8.6548 < 2.2e-16 ***
log(plant_height) 0.480629 0.054787 8.7727 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08438 on 376 degrees of freedom
Multiple R-squared: 0.1699, Adjusted R-squared: 0.1677
F-statistic: 76.96 on 1 and 376 DF, p-value: < 2.2e-16
summary(pglsModel)
Call:
pgls(formula = log(leaf_area) ~ log(plant_height), data = compCDR,
lambda = "ML")
Residuals:
Min 1Q Median 3Q Max
-0.33751 -0.03045 0.01565 0.05670 0.26769
Branch length transformations:
kappa [Fix] : 1.000
lambda [ ML] : 0.575
lower bound : 0.000, p = 4.4409e-16
upper bound : 1.000, p = < 2.22e-16
95.0% CI : (0.398, 0.721)
delta [Fix] : 1.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.332669 0.616152 8.6548 < 2.2e-16 ***
log(plant_height) 0.480629 0.054787 8.7727 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08438 on 376 degrees of freedom
Multiple R-squared: 0.1699, Adjusted R-squared: 0.1677
F-statistic: 76.96 on 1 and 376 DF, p-value: < 2.2e-16
coef(pglsModel) (Intercept) log(plant_height)
5.3326689 0.4806286
traitData %>%
ggplot(aes(x = log(plant_height), y = log(leaf_area))) +
geom_point(alpha = 0.5, color = "darkgray", size = 3) +
labs(x = "log(Plant height)", y = "log(Leaf area)") +
geom_abline(intercept = coef(pglsModel)[1], slope = coef(pglsModel)[2],
color = "red", linewidth = 1.5)# will plot the pgls regression line on your biplot.## Loglik of the PGLS model
logLik(pglsModel)'log Lik.' -554.4899 (df=2)
## Estimate null model or intercept-only model
y <- as.numeric(fitted(pglsModel) + resid(pglsModel))
nullModel <- lm(y ~ 1)
## loglik null model
logLik(nullModel)'log Lik.' -613.6878 (df=2)
## Is higher than a simple null model
logLik(pglsModel) > logLik(nullModel) [,1]
[1,] TRUE
## Is higher than the lm model
logLik(pglsModel) > logLik(lmModel) [,1]
[1,] TRUE
Let’s use Information Criteria and see which model performed the best.
AIC(lmModel)[1] 1659.054
AIC(nullModel)[1] 1231.376
AIC(pglsModel)[1] 1112.98
Or we can make our model comparisons more interpretable.
mods <- c(AIC(lmModel), AIC(nullModel), AIC(pglsModel))
names(mods) <- c("OLS", "Null", "PGLS")
aics <- aicw(mods)
names(aics)[1] <- "AIC"
aics$lnL <- c(logLik(lmModel), logLik(nullModel), logLik(pglsModel))
kable(aics)| AIC | delta | w | lnL | |
|---|---|---|---|---|
| OLS | 1659.054 | 546.0736 | 0 | -826.5268 |
| Null | 1231.376 | 118.3957 | 0 | -613.6878 |
| PGLS | 1112.980 | 0.0000 | 1 | -554.4899 |
Seems that PGLS is outperforming the OLS, let’s plot the results:
traitData %>%
ggplot(aes(x = log(plant_height), y = log(leaf_area))) +
geom_point(alpha = 0.5, color = "darkgray", size = 3) +
labs(x = "log(Plant height)", y = "log(Leaf area)") +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.5) + # OLS slope
geom_abline(intercept = coef(pglsModel)[1], slope = coef(pglsModel)[2], # PGLS slope
color = "red", linewidth = 1.5) `geom_smooth()` using formula = 'y ~ x'
# will plot the pgls regression line on your biplot.How do you feel about that?
Phylogenetic signal is the tendency of related species to resemble each other more than species drawn at random from the same tree.
Blomberg’s K compares the variance of PICs to what we would expect under a Brownian motion (BM) model of evolution. K = 1 means that close relatives resemble each other as much as we should expect under BM. K < 1 that there is less phylogenetic signal than expected under BM and K > 1 means that there is more. In addition, a significant p-value returned from a randomization test tells us that the phylogenetic signal is significant, in other words, close relatives are more similar than random pairs of taxa in the dataset.
# Run Blomberg's K
K_PH <- phylosig(tree = phyData, # Phylogeny
x = log(plant_height), # trait
method = "K", # method
test = TRUE)
# Print results
print(K_PH)
Phylogenetic signal K : 0.00112205
P-value (based on 1000 randomizations) : 0.005
# Plot results
plot(K_PH)Pagel’s \(\lambda\) is a tree transformation that stretches the tip branches relative to internal branches, making the tree more and more like a complete polytomy of a star phylogeny. If \(\lambda = 0\) there is no phylogenetic signal, while \(\lambda = 1\) correspond to BM and \(0 < \lambda < 1\) in between.
# Run Pagel's Lambda
LB_PH <- phylosig(tree = phyData,
x = log(plant_height),
method = "lambda",
test = TRUE)
# Print the results
print(LB_PH)
Phylogenetic signal lambda : 0.794384
logL(lambda) : -741.291
LR(lambda=0) : 324.333
P-value (based on LR test) : 1.64816e-72
# Plot thre results
plot(LB_PH)Describe the results of phylogenetic signal. Does plant height presents phylogenetic signal?
Brownian Motion is only one model of evolution for a continuous variable. This model assumes that a trait evolves from a starting state (z0) according to a random walk with the variance specified by the rate parameter \(\sigma^{2}\) (sigma-squared). In short, Brownian motion describes a process in which tip states are modeled under the assumption of a multivariate normal distribution. On a phylogeny, the multivariate mean of tip states is equal to the root state estimate, and variance accumulates linearly through time.
Another model is the Ornstein-Uhlenbeck (OU) model, which allows the trait mean to evolve towards a new state (theta), with a selective force (alpha). These two new parameters, plus the starting state (z0) and the rate of evolution (sigsq) parameters from the BM model, make for a 4-parameter model.
The Early Burst (EB) model allows the rate of evolution to change across the tree, where the early rate of evolution is high and declines over time (presumably as niches are filled during an adaptive radiation. The rate of evolution changes exponentially over time and is specified under the model r[t] = r[0] x exp(a x t), where r[0] is the initial rate, a is the rate change parameter, and t is time. The maximum bound is set to -0.000001, representing a decelerating rate of evolution. The minimum bound is set to \(log(10^{-5})\)/depth of the tree.
Let’s evaluate the relative fit of these three models to the Hand-wing index trait.
BMModel <- fitContinuous(phy = phyData, # phylogeny
dat = log(leaf_area), # trait
model = "BM") # evolutionary model
BMModelGEIGER-fitted comparative model of continuous data
fitted 'BM' model parameters:
sigsq = 35.449783
z0 = 6.127122
model summary:
log-likelihood = -2534.051403
AIC = 5072.102805
AICc = 5072.123821
free parameters = 2
Convergence diagnostics:
optimization iterations = 100
failed iterations = 0
number of iterations with same best fit = 100
frequency of best fit = 1.000
object summary:
'lik' -- likelihood function
'bnd' -- bounds for likelihood search
'res' -- optimization iteration summary
'opt' -- maximum likelihood parameter estimates
OUModel <- fitContinuous(phy = phyData, # phylogeny
dat = log(leaf_area), # trait
model = "OU", # evolutionary model
control = list(method = c("subplex")))
OUModelGEIGER-fitted comparative model of continuous data
fitted 'OU' model parameters:
alpha = 2.718282
sigsq = 41.672016
z0 = 6.868176
model summary:
log-likelihood = -1359.134310
AIC = 2724.268621
AICc = 2724.310726
free parameters = 3
Convergence diagnostics:
optimization iterations = 100
failed iterations = 0
number of iterations with same best fit = 3
frequency of best fit = 0.030
object summary:
'lik' -- likelihood function
'bnd' -- bounds for likelihood search
'res' -- optimization iteration summary
'opt' -- maximum likelihood parameter estimates
EBModel <- fitContinuous(phy = phyData, # phylogeny
dat = log(leaf_area), # trait
model = "EB")Warning in fitContinuous(phy = phyData, dat = log(leaf_area), model = "EB"):
Parameter estimates appear at bounds:
a
EBModelGEIGER-fitted comparative model of continuous data
fitted 'EB' model parameters:
a = -0.000001
sigsq = 35.459466
z0 = 6.127103
model summary:
log-likelihood = -2534.055802
AIC = 5074.111603
AICc = 5074.153708
free parameters = 3
Convergence diagnostics:
optimization iterations = 100
failed iterations = 0
number of iterations with same best fit = 10
frequency of best fit = 0.100
object summary:
'lik' -- likelihood function
'bnd' -- bounds for likelihood search
'res' -- optimization iteration summary
'opt' -- maximum likelihood parameter estimates
And recover the parameter values and fit estimates.
BMModel
OUModel
EBModelCompare all models and select the best fitting model. To to that, we will use AIC model comporison approach based on weights.
# Vector of models
EvoMods <- c(BMModel$opt$aicc, OUModel$opt$aicc, EBModel$opt$aicc)
# rename the models
names(EvoMods) <- c("BM", "OU", "EB")
# Run AIC weights
EvoModsComparison <- aicw(EvoMods)
names(EvoModsComparison)[1] <- "AIC"
EvoModsComparison$lnL <- c(brownianModel$opt$lnL, OUModel$opt$lnL, EBModel$opt$lnL)
kable(EvoModsComparison)| AIC | delta | w | lnL | |
|---|---|---|---|---|
| BM | 5072.124 | 2347.813 | 0 | -2534.051 |
| OU | 2724.311 | 0.000 | 1 | -1359.134 |
| EB | 5074.154 | 2349.843 | 0 | -2534.056 |
Make a table with the AIC and lnL values for each model. Which model provides the best fit for leaf area?
Answer: According to the model comparison using AIC, the model that best fits the data is the OU model.
Now, add the results for a model fitting analysis of plant size to this table.
BM_PH <- fitContinuous(phy = phyData, # phylogeny
dat = log(plant_height), # trait
model = "BM")
OU_PH <- fitContinuous(phy = phyData, # phylogeny
dat = log(plant_height), # trait
model = "OU")Warning in fitContinuous(phy = phyData, dat = log(plant_height), model = "OU"):
Parameter estimates appear at bounds:
alpha
EB_PH <- fitContinuous(phy = phyData, # phylogeny
dat = log(plant_height), # trait
model = "EB")Warning in fitContinuous(phy = phyData, dat = log(plant_height), model = "EB"):
Parameter estimates appear at bounds:
a
# Vector of models
mods_PH <- c(BM_PH$opt$aicc, OU_PH$opt$aicc, EB_PH$opt$aicc)
# rename the models
names(mods_PH) <- c("BM", "OU", "EB")
# Run AIC weights
aic_PH <- aicw(mods_PH)
names(aic_PH)[1] <- "AIC"
aic_PH$lnL <- c(BM_PH$opt$lnL, OU_PH$opt$lnL, EB_PH$opt$lnL)
kable(aic_PH)| AIC | delta | w | lnL | |
|---|---|---|---|---|
| BM | 4552.629 | 2211.921 | 0 | -2274.304 |
| OU | 2340.709 | 0.000 | 1 | -1167.333 |
| EB | 4554.659 | 2213.951 | 0 | -2274.309 |
One last thing that I would like to present is the estimation of the metric phylogenetic half-life that can be interpreted as the time it takes for half of the information in the phylogeny to be erased.
# half-life leaf area
log(2) / OUModel$opt$alpha [1] 0.2549946